Santali Raw Speech Corpus
OverView
50:58:13 Hours | 15.08 GB | 60 Speakers | 11844 Audio SegmentsThe Santali Raw Speech Corpus speech data is recorded using the SDCP (Speech Data Collection Por...Your request cart is empty!
Dataset Description
50:58:13 Hours | 15.08 GB | 60 Speakers | 11844 Audio Segments
The Santali Raw Speech Corpus speech data is recorded using the SDCP (Speech Data Collection Portal) mobile application developed by the Linguistic Data Consortium for Indian Languages (LDC-IL). This corpus consists of recordings of native Santali speakers from various parts of the state of Jharkhand. The corpus contains multiple audio recordings for each content category, with each recording featuring different material. Speakers from different age groups recorded spontaneous speech, including literary excerpts and sentences. The sentences were taken from the Parallel Corpus developed by LDC-IL.
A minimum of 2,000 words of read speech is recorded from each speaker, which includes approximately 1,000 words from sentences (Parallel Corpus) and 1,000 words from literary texts. The data also includes spontaneous speech recordings based on various topics or questions related to day-to-day life from different domains.
For any research-based citations, please use the following citations:
1. Dharmenjay Hembram, Ankita Tiwari, Dr. Narayan Kumar Choudhary, Prof. Shailendra Mohan. 2026. Santali Raw Speech Corpus. Central Institute of Indian Languages, Mysore. 978-81-69175-04-3.
2. Narayan Choudhary, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
3. Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1. doi: https://doi.org/10.1007/s10579-020-09523-3.
Item specifics
- Authors Dharmenjay Hembram, Ankita Tiwari, Dr. Narayan Choudhary, Prof. Shailendra Mohan
- Corpus Type Raw Speech Corpus
- Catalogue Number 1689
- ISBN 978-81-69175-04-3
- Data Source
- Duration
- Release Date 23/3/2026
- Terms and Conditions General instructions for use of the resources provided by LDC-IL.
